python scikit-learn svm cross-validation grid-search

Identifying overfitting in a cross validated SVM when tuning parameters

I have an rbf SVM that I'm tuning with gridsearchcv. How do I tell if my good results are actually good results or whether they are overfitting?

Solution

Overfitting is generally associated with high variance, meaning that the model parameters that would result from being fitted to some realized data set have a high variance from data set to data set. You collected some data, fit some model, got some parameters ... you do it again and get new data and now your parameters are totally different.

One consequence of this is that in the presence of overfitting, usually the training error (the error from re-running the model directly on the data used to train it) will be very low, or at least low in contrast to the test error (running the model on some previously unused test data).

One diagnostic that is suggested by Andrew Ng is to separate some of your data into a testing set. Ideally this should have been done from the very beginning, so that happening to see the model fit results inclusive of this data would never have the chance to impact your decision. But you can also do it after the fact as long as you explain so in your model discussion.

With the test data, you want to compute the same error or loss score that you compute on the training data. If training error is very low, but testing error is unacceptably high, you probably have overfitting.

Further, you can vary the size of your test data and generate a diagnostic graph. Let's say that you randomly sample 5% of your data, then 10%, then 15% ... on up to 30%. This will give you six different data points showing the resulting training error and testing error.

As you increase the training set size (decrease testing set size), the shape of the two curves can give some insight.

The test error will be decreasing and the training error will be increasing. The two curves should flatten out and converge with some gap between them.

If that gap is large, you are likely dealing with overfitting, and it suggests to use a large training set and to try to collect more data if possible.

If the gap is small, or if the training error itself is already too large, it suggests model bias is the problem, and you should consider a different model class all together.

Note that in the above setting, you can also substitute a k-fold cross validation for the test set approach. Then, to generate a similar diagnostic curve, you should vary the number of folds (hence varying the size of the test sets). For a given value of k, then for each subset used for testing, the other (k-1) subsets are used for training error, and averaged over each way of assigning the folds. This gives you both a training error and testing error metric for a given choice of k. As k becomes larger, the training set sizes becomes bigger (for example, if k=10, then training errors are reported on 90% of the data) so again you can see how the scores vary as a function of training set size.

The downside is that CV scores are already expensive to compute, and repeated CV for many different values of k makes it even worse.

One other cause of overfitting can be too large of a feature space. In that case, you can try to look at importance scores of each of your features. If you prune out some of the least important features and then re-do the above overfitting diagnostic and observe improvement, it's also some evidence that the problem is overfitting and you may want to use a simpler set of features or a different model class.

On the other hand, if you still have high bias, it suggests the opposite: your model doesn't have enough feature space to adequately account for the variability of the data, so instead you may want to augment the model with even more features.