Search code examples
rmachine-learningclassificationrandom-forest

RF: high OOB accuracy by one class and very low accuracy by the other, with big class imbalance


I am using a random forest classifier to classify a dataset that has two classes.

  • The number of features is 512.
  • The proportion of the data is 1:4. I.e, 75% of the data is from the first class and 25% of the second one.
  • I am using 500 trees.

The classifier produces an out of bag error of 21.52%. The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.

I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.

I am looking forwards to your help.

Thanks

In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.

Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:

  • randomForest(formula = Label ~ ., data = chData30PixG12, ntree = 10,importance = TRUE, nodesize = 1, keep.forest = FALSE, do.trace = 50)

  • Type of random forest: classification

  • Number of trees: 10

  • No. of variables tried at each split: 22

  • OOB estimate of error rate: 24.46%

  • Confusion matrix:

             Irrelevant , Relevant , class.error
 Irrelevant  37954      ,  4510    , 0.1062076
 Relevant    8775       ,  3068    , 0.7409440

Solution

  • I agree with @usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.

    One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).

    While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.