Search code examples
pythonmachine-learningscikit-learnsvm

Random Forest performing much better than other methods


When evaluating the performance of SVM, RF and a DT (max_depth = 3), I am getting really superior results with the RF model. The data being modeled on is real world data. They are all evaluated using stratified cross validation, since the data set is imbalanced.

For the 4 different classes seen before, I am getting these scores for precision, recall and F1.

Originally, the data set contained the following values_counts for the 4 classes shown below:

  1. Feeding faults- (Diff. P-set/P-actual): 4 098 data samples
  2. Feeding faults- (Feeding safety circuit faulty): 383 data samples
  3. Generator heating: 228 668 data samples
  4. Other: 51 966 851 samples

How could RF be so much better than SVM and DT?

Thanks in advance!

enter image description here

enter image description here


Solution

  • These results are entirely plausible! The Random Forest is a much more powerful algorithm than the Decision Tree, because it basically is an ensemble of DTs. Ensembles (a combination of more models) are notoriously powerful in Machine Learning when it comes to generalisation on unseen data. Where the Decision Tree or SVMs overfit, the Random Forest usually performs relatively well, because internally many DTs seeing all a different set of features are casting a vote for the result.