Search code examples
rparallel-processingregressioncart-analysislarge-data

Running regression tree on large dataset in R


I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).

I can think of two main problems that are slowing down the calculation

  1. The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
  2. The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.

Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?

* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.

** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?


Solution

  • My response comes from a class I took that used these slides (see slide 20).

    The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.

    A few recommended solutions:

    • Bin your categorical predictor into fewer bins (that are still meaningful to you).
    • Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
    • Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.

    Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.