Search code examples
machine-learningnaivebayes

Accuracy increases using cross-validation and decreases without


I have a question regarding cross validation: I'm using a Naive Bayes classifier to classify blog posts by author. When I validate my dataset without k-fold cross validation I get an accuracy score of 0.6, but when I do k-fold cross validation, each fold renders a much higher accuracy (greater than 0.8).

For Example:

(splitting manually): Validation Set Size: 1452,Training Set Size: 13063, Accuracy: 0.6033057851239669

and then

(with k-fold): Fold 0 -> Training Set Size: 13063, Validation Set Size: 1452 Accuracy: 0.8039702233250621 (all folds are over 0.8)

etc...

Why does this happen?


Solution

  • There are a few reasons this could happen:

    1. Your "manual" split is not random, and you happen to select more outliers that are hard to predict. How are you doing this split?

    2. What is the k in k-fold CV? I'm not sure what you mean by Validation Set Size, you have a fold size in k-fold CV. There is no validation set, you run the cross validation using your entire data. Are you sure you're running k-fold cross validation correctly?

    Usually, one picks k = 10 for k-fold cross validation. If you run it correctly using your entire data, you should rely on its results instead of other results.