Search code examples
rtreestatisticsrpart

rpart: Computational time for categorical vs continuous regressors


i am currently using the rpart package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.

from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).

however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?

here is a sample code emulating my data:

library(rpart)

x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)

x.fac <- as.data.frame(lapply(x, factor))

now compare:

system.time(rpart( y ~ ., data = x, method = 'anova'))

   user  system elapsed 
   1.62    0.21    1.85 

system.time(rpart( y ~ ., data = x.fac, method = 'anova'))

   user  system elapsed 
   246.87  165.91  412.92 

dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart behaviour. any clarifications/suggestions would be very apprecaited.


Solution

  • You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.

    Try

    Rprof("rpartProfile.Rprof")
    rpart( y ~ ., data = x.fac, method = 'anova')
    Rprof()
    
    summaryRprof("rpartProfile.Rprof")
    

    and look to see where the time is being spent. Which I have now done:

    > summaryRprof("rpartProfile.Rprof")
    $by.self
                              self.time self.pct total.time total.pct
    "[[<-.data.frame"            786.46    72.45     786.56     72.46
    "rpart.matrix"               294.26    27.11    1081.78     99.66
    "model.frame.default"          1.04     0.10       3.00      0.28
    "terms.formula"                0.96     0.09       0.96      0.09
    "as.list.data.frame"           0.46     0.04       0.46      0.04
    "makepredictcall.default"      0.46     0.04       0.46      0.04
    "rpart"                        0.44     0.04    1085.38     99.99
    "[[.data.frame"                0.16     0.01       0.42      0.04
    "<Anonymous>"                  0.16     0.01       0.18      0.02
    "match"                        0.14     0.01       0.22      0.02
    "print"                        0.12     0.01       0.12      0.01
    "model.matrix.default"         0.10     0.01       0.44      0.04
    ....
    
    $by.total
                              total.time total.pct self.time self.pct
    "rpart"                      1085.38     99.99      0.44     0.04
    "rpart.matrix"               1081.78     99.66    294.26    27.11
    "[[<-"                        786.62     72.47      0.06     0.01
    "[[<-.data.frame"             786.56     72.46    786.46    72.45
    "model.frame.default"           3.00      0.28      1.04     0.10
    "eval"                          3.00      0.28      0.04     0.00
    "eval.parent"                   3.00      0.28      0.00     0.00
    "model.frame"                   3.00      0.28      0.00     0.00
    "terms.formula"                 0.96      0.09      0.96     0.09
    "terms"                         0.96      0.09      0.00     0.00
    "makepredictcall"               0.50      0.05      0.04     0.00
    "as.list.data.frame"            0.46      0.04      0.46     0.04
    "makepredictcall.default"       0.46      0.04      0.46     0.04
    "as.list"                       0.46      0.04      0.00     0.00
    "vapply"                        0.46      0.04      0.00     0.00
    "model.matrix.default"          0.44      0.04      0.10     0.01
    "[["                            0.44      0.04      0.02     0.00
    "model.matrix"                  0.44      0.04      0.00     0.00
    ....
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 1085.5
    

    Note from the above that a big chunk of time is spent in function rpart.matrix:

    > rpart:::rpart.matrix
    function (frame) 
    {
        if (!inherits(frame, "data.frame") || is.null(attr(frame, 
            "terms"))) 
            return(as.matrix(frame))
        for (i in 1:ncol(frame)) {
            if (is.character(frame[[i]])) 
                frame[[i]] <- as.numeric(factor(frame[[i]]))
            else if (!is.numeric(frame[[i]])) 
                frame[[i]] <- as.numeric(frame[[i]])
        }
        X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
        colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
        class(X) <- c("rpart.matrix", class(X))
        X
    }
    

    But it is the for loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.