python machine-learning scikit-learn cross-validation train-test-split

Big difference in score (10%) between a split_test_train and a cross validation

I'm on a classification issue with: 2,500 lines. 25000 columns 88 different classes unevenly distributed

And then something very strange happened:

When I run a dozen different split test trains, I always get scores around 60%...

And when I run cross validations, I always get scores around 50%. Here the screen : Moreover it has nothing to do with the unequal distribution of classes because when I put a stratify=y on the TTS I stay around 60% and when I put a StratifiedKFold I stay around 50%.

Which score to remember? Why the difference? For me a CV was just a succession of test train splits with different splits from each other, so nothing justifies such a difference in score.

Solution

Short answer: Add shuffle=True to your KFold : cross_val_score(forest,X,y,cv=KFold(shuffle=True))

Long answer: the difference between a succession of TrainTestSplit and a cross-validation with a classic KFold is that there is a mix in the TTS before the split between the train and the test set. The difference in score may be due to the fact that your dataset is sorted in a biased way. So just add shuffle=True to your KFold (or your StratifiedKFold and that's all you need to do).