python machine-learning scikit-learn svm

Random Forest performing much better than other methods

When evaluating the performance of SVM, RF and a DT (max_depth = 3), I am getting really superior results with the RF model. The data being modeled on is real world data. They are all evaluated using stratified cross validation, since the data set is imbalanced.

For the 4 different classes seen before, I am getting these scores for precision, recall and F1.

Originally, the data set contained the following values_counts for the 4 classes shown below:

Feeding faults- (Diff. P-set/P-actual): 4 098 data samples
Feeding faults- (Feeding safety circuit faulty): 383 data samples
Generator heating: 228 668 data samples
Other: 51 966 851 samples

How could RF be so much better than SVM and DT?

Thanks in advance!

Solution

These results are entirely plausible! The Random Forest is a much more powerful algorithm than the Decision Tree, because it basically is an ensemble of DTs. Ensembles (a combination of more models) are notoriously powerful in Machine Learning when it comes to generalisation on unseen data. Where the Decision Tree or SVMs overfit, the Random Forest usually performs relatively well, because internally many DTs seeing all a different set of features are casting a vote for the result.