How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn
package (built on libsvm
)
Selection of best C
and gamma
is performed using grid search with cross validation. You should try vast range of values here, for C
it is reasonable to choose values between 1
and 10^15
while a simple and good heuristic of gamma
range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma
- if you select such gamma
that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.