Search code examples
machine-learningtextscikit-learnnlptrain-test-split

How to fix 'ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]'?


I am making a Logistic Regression model to do sentiment analysis. This is the problem - ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602] This occurs when I try to split my dataset into x and y train and valid sets.

# splitting data into training and validation set 
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
lreg = LogisticRegression() # training the model 
lreg.fit(xtrain_bow, ytrain) 
prediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set 
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 
prediction_int = prediction_int.astype(np.int) 
f1_score(yvalid, prediction_int) # calculating f1 score for the validation set 

I saw in some posts that it can occur because of the shape of the X and y, so printed out the shapes of the datset, I have splitted my dataset into 85% for training and rest for test/valid purpose.

# Extracting train and test BoW features
split_frac = 0.85

split_num = int(len(combi['tidy_tweet']) * split_frac)

train_bow = bow[:split_num,:] 
test_bow = bow[split_num:,:] 
print(train_bow.shape)
print(test_bow.shape)
print(train['label'].shape)

(32979, 1000)
(5820, 1000)
(21602,)

Also the problem is in this line-

----> 1 xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
      2 lreg = LogisticRegression() # training the model
      3 lreg.fit(xtrain_bow, ytrain)

Now I am clueless, that what is actually causing the problem? Can you guys please help? Thanks in advance.


Solution

  • You are getting above error because the length of second parameter, i.e., the label, in train_test_split() is 21602 while the length of first parameter is 32979, which should not be. The length both X and Y data must be same. So, check the length of train_bow and train['label'].

    So, just change

    xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42) to something like below:

    xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(bow[:split_num,:-1], bow[:split_num,-1], test_size=0.3, random_state=42)
    

    (Assuming bow contains both features and labels, labels being the last column).

    Read more sklearn.model_selection.train_test_split from here.