I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()
* function from the party
package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob()
, since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
ordered factor
in R
randomForest
package is to set the randomForest
parameter mtry
to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest
method ensure you don't overfit your categorical variable.Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.