Search code examples
pythonpandasmachine-learningdata-science

Evaluating test dataset


For most competitions, the data is split into a training data and test data. I worked on the training data(spliting it into x_train...) and created a model, i evaluated it and got the respective accuracy score. I used the same model to predict the test dataset that was left out, however, when i tried to evaluate the models performance, i kept getting this error below: Could anyone explain what am doing wrong? and give ways to remedy it.

The Code:

# logistic regression

logreg_main_test = logreg.predict(main_test_scaled) # predict

# evaluate
logreg_score_main_test = accuracy_score(Y_test, logreg_main_test)
f1_val_main_test = f1_score(Y_test, logreg_score_main_test)
recall_val_main_test = recall_score(Y_test, logreg_score_main_test)

# display result
print('Model accuracy:',logreg_score_main_test)

The output error

ValueError                                Traceback (most recent call last)
<ipython-input-51-8265b6fa0a29> in <module>
      4 
      5 # evaluate
----> 6 logreg_score_main_test = accuracy_score(Y_test, logreg_main_test)
      7 f1_val_main_test = f1_score(Y_test, logreg_score_main_test)
      8 recall_val_main_test = recall_score(Y_test, logreg_score_main_test)

2 frames
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [4705, 10086]

Solution

  • The length(number of samples) in your truth labels doesn't match with the number of samples in your predicted labels.
    Check the length of your Y_test and logreg_main_test, it should match if not then either your split is incorrect or you are trying to predict with the train split instead of test split.