r regression sampling random-forest cross-validation

regression with random forest on imbalanced data

I'm using r package of random forest to predict the distances between pairs of proteins based on their amino acid sequence, the main interest is the proteins that are close (has smaller distance). my training dataset consist of 10k pair of proteins and the actual distance between them. however, very few pairs of protein (less than 0.2%) has small distances between them, and the problem is that the trained random forest became very accurate in predicting the distance between proteins with large distances and very bad for proteins that have small distances between them. I tried to down-sample the proteins with the large distances in my training data, but the results are still not good. I'm more interested in close proteins (those pairs who have small distance between them). there is a very clear signal of over-fitting since my training accuracy is 78 and my testing accuracy is 51% any suggestions are highly appreciated

Solution

A couple suggestions:

1) Look at GBM's from the gbm package.

2) Create more features to help the RF understand what drives distance.

3) Plot errors vs individual variables to look for what is driving relationships. (ggplot2 is great for this especially using the colour and size options.)

4) You could also assign 1 or 0 to y-variables based on distance (ie if distance < x; set to 1 / if distance >= x; set to 0). Once you have two classes you can use the strata argument in RF to create uniformly balanced samples and see what variables are driving the difference in distance using the importance() and varImpPlot() functions of RF.

5) Try using log of distance-related variables. RF is usually pretty good about compensating for non-linearity but it can't hurt to try.

My guess is that #2 is where you want to spend your time though it is also the hardest and requires the most thought.