machine-learning scikit-learn random-forest

Why I got a much higher score in cross_val_score() than in actual test?

I've been using random forest in sklearn to predict a set of data, and the following code shows the output:

print(np.mean(cross_val_score(rf, X_train_resampled,
      y_train_resampled, cv=5, scoring='accuracy')))
print(balanced_accuracy_score(y_valid, predictions))

However, the cross_val_score method gives a 0.93 accuracy (which is obviously much higher than the actual test) while the balanced_accuracy_score gives a 0.40 accuracy.

I've been asking newbing and checking stackoverflow but got no good enough answers. Is it a problem occuring when the model is not good enough, or I have made something wrong?

Solution

Yes, your model is not good. He was able to cheat due to data imbalance.

For example, I created a dataset where 95% class1, 5% class0. If you test a dummy model( which always return 1) on this dataset, you get:

import sklearn
from sklearn.metrics import accuracy_score, balanced_accuracy_score
import numpy as np
data = np.random.randn(100, 10)
labels = np.array(95*[1] + 5 * [0])
class model:
    
    def __init__(self):
        pass
    
    def predict(self, x):
        return np.ones(x.shape[0])
    
dummy_model = model()
print(accuracy_score(labels, dummy_model.predict(data)))
print(balanced_accuracy_score(labels, dummy_model.predict(data)))