machine-learning scikit-learn cross-validation

Difference in cross validation and testing performance

I am using sklearn with cross validation(5 fold).

Cross validation. I get my data set and use it in 5 fold cross validation. The returned scores (all 5) are in the range .80 to .85

Direct Training If I use the same data set with train test split (0.2 test portion) and directly fit and predict, I get around .70 accuracy. (recall and ROC AUC are also less than that).

So, In cross validation, a single combination of folds is equal to what we do directly in train test split right ? Then why is there a huge difference? I have read that the reason is the cross validation is over fitting to training data. But when a single setting(combination) of cross validation is considered, isnt it the same as a direct fit and predict ? If somehow I know the exact way a particular combination in cross validation splits the data, and use that exact splitting method to be use in direct approach, shouldn't I get the same accuracy ?

Solution

Without looking at the codes and your data, I can only give an educated guess. First of all, the reason we need validation dataset is to tune hyperparameters. Using cross validation, we try to find the best hyperparameters that give us the best prediction accuracy on the validation set. Thus the final model with the hyperparameters it chose overfits the validation dataset. So the prediction accuracy on the validation dataset is not a true measurement of the performance of your model. You need to have a hold-out never touched testing dataset to evaluate your model performance.

If you use train/test split only without validation set, the performance on test dataset could be worse due to

Your hyperparameters are not tuned since you do not have validation dataset
Because your model never sees the test dataset, it is not overfitting to it.