I have a question regarding cross validation: I'm using a Naive Bayes classifier to classify blog posts by author. When I validate my dataset without k-fold cross validation I get an accuracy score of 0.6, but when I do k-fold cross validation, each fold renders a much higher accuracy (greater than 0.8).
For Example:
(splitting manually): Validation Set Size: 1452,Training Set Size: 13063, Accuracy: 0.6033057851239669
and then
(with k-fold): Fold 0 -> Training Set Size: 13063, Validation Set Size: 1452 Accuracy: 0.8039702233250621 (all folds are over 0.8)
etc...
Why does this happen?
There are a few reasons this could happen:
Your "manual" split is not random, and you happen to select more outliers that are hard to predict. How are you doing this split?
What is the k
in k-fold CV? I'm not sure what you mean by Validation Set Size, you have a fold size in k-fold CV. There is no validation set, you run the cross validation using your entire data. Are you sure you're running k-fold cross validation correctly?
Usually, one picks k = 10
for k-fold cross validation. If you run it correctly using your entire data, you should rely on its results instead of other results.