classification svm logistic-regression random-forest

svm takes long time for hyperparameter tuning

I am running SVM,Logistic Rregression and Random Forest on the credit card dataset. My training dataset has the shape (454491, 30). I performed 5-fold cross validation(which took more than an hour) with 'scoring' set as 'f1_weighted' and got the below results:

Logistic Regression Cross Validation F1 score: 0.9501631748525725
Random Forest Cross Validation F1 score: 0.9999383944188953
Support Vector Cross Validation F1 score: 0.9989703035983751

I chose SVM as Random Forest is prone to overfitting and SVM scored better than Logistic Regression.

I wanted to add regularization by hyperparameter tuning. I initially used GridSearchCV but it was taking a long time so I changed it to RandomizedSearchCV but even this is taking a very long time (around 4-5+ hours). According to the data description, the features have been scaled and PCA has been performed to preserve the private information. I have also used RobustScaler() on Amount and Time columns as they were not scaled.

tuned_parameters={"kernel":['rbf','sigmoid'],
                 'gamma':[1e-2, 1e-3, 1e-4, 1e-5],
                 'C':[0.001, 0.10, 0.0001,0.00001]}


tuned_SVM=RandomizedSearchCV(SVC(),tuned_parameters,cv=3,scoring='f1_weighted',random_state=12,verbose=15,n_jobs=-1)

Any suggestions on how to proceed?

Solution

First of all; the idea of Random Forest is to reduce overfitting. It is correct that at single Decision Tree is (very often) very overfit- that is why we create this ensemble to reduce the variance but still keep the bias low. Note that the variance of the Random Forest (overfitting) can vary heavily on your parameters e.g if you set max_features=n_features where n_features is the number of features, you will just create a multiple of the same decision trees which will indeed lead to overfitting.

My point is here, that Random Forest can be a strong tool and a decision tree is heavily overfitting while random forest (should) not.

Secondly; if I recall correctly, the training time of SVM is O(n^2) where n is the number of training points i.e when having a lot of training data it can take a long time to fit thus grid-searching over the parameters can take a long (!) time.

Third; regarding regularization. If you have had a 0.99 val-score using a kernel (assume it is "rbf") I would just run with the "rbf" kernel and just tune the C-parameter due to the training-time of SVM. Also, aren't 0.998 good enough?

Decision Trees can be regularized using "cost complexity pruning" where you, just in logistic regression, penalize the scoring function with the complexity (depth of tree). I don't think Random Forest (sklearn) have that option. Their Decision Trees have, so you can build your own Random Forest from that but since you have the max_depth parameter in the Random Forest I don't think that is worth it.

Fourth; Consider bayesian optimization

Why do you want to use SVM when Random Forest is (according to your results) performing better and is faster to train?