Search code examples
machine-learningscikit-learnrandom-forest

Why I got a much higher score in cross_val_score() than in actual test?


I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occuring when the model is not good enough, or I have made something wrong?


Solution

  • Yes, your model is not good. He was able to cheat due to data imbalance.

    For example, I created a dataset where 95% class1, 5% class0. If you test a dummy model( which always return 1) on this dataset, you get:

    import sklearn
    from sklearn.metrics import accuracy_score, balanced_accuracy_score
    import numpy as np
    data = np.random.randn(100, 10)
    labels = np.array(95*[1] + 5 * [0])
    class model:
        
        def __init__(self):
            pass
        
        def predict(self, x):
            return np.ones(x.shape[0])
        
    dummy_model = model()
    print(accuracy_score(labels, dummy_model.predict(data)))
    print(balanced_accuracy_score(labels, dummy_model.predict(data)))