I am using a random forest classifier to classify a dataset that has two classes.
The classifier produces an out of bag error of 21.52%. The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.
I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.
I am looking forwards to your help.
Thanks
In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.
Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:
randomForest(formula = Label ~ ., data = chData30PixG12, ntree = 10,importance = TRUE, nodesize = 1, keep.forest = FALSE, do.trace = 50)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 22
OOB estimate of error rate: 24.46%
Confusion matrix:
Irrelevant , Relevant , class.error
Irrelevant 37954 , 4510 , 0.1062076
Relevant 8775 , 3068 , 0.7409440
I agree with @usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.
One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).
While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.