i am currently using the rpart
package to fit a regression tree to a data with relatively few observations and several thousand categorical predictors taking two possible values.
from testing the package out on smaller data i know that in this instance it doesn't matter whether i declare the regressors as categorical (i.e. factors) or leave them as they are (they are coded as +/-1).
however, i would still like to understand why passing my explanatory variables as factors significantly slows the algorithm down (not least because i shall soon get new data where response takes 3 diffirent values and treating them as continuous would no longer be an option). surely it should be the other way round?
here is a sample code emulating my data:
library(rpart)
x <- as.data.frame(matrix(sample(c(-1, +1), 50 * 3000, replace = T), nrow = 50))
y <- rnorm(50)
x.fac <- as.data.frame(lapply(x, factor))
now compare:
system.time(rpart( y ~ ., data = x, method = 'anova'))
user system elapsed
1.62 0.21 1.85
system.time(rpart( y ~ ., data = x.fac, method = 'anova'))
user system elapsed
246.87 165.91 412.92
dealing with only one potential split possibility per variable (factors) is simpler and faster than dealing with a whole range of potential splits (for continuous variables), so i am most confused by the rpart
behaviour. any clarifications/suggestions would be very apprecaited.
You need to profile the code to be sure, but I would be surprised if the timing difference did not come from R having to turn each factor variable into two binary variables as it prepares the model matrix.
Try
Rprof("rpartProfile.Rprof")
rpart( y ~ ., data = x.fac, method = 'anova')
Rprof()
summaryRprof("rpartProfile.Rprof")
and look to see where the time is being spent. Which I have now done:
> summaryRprof("rpartProfile.Rprof")
$by.self
self.time self.pct total.time total.pct
"[[<-.data.frame" 786.46 72.45 786.56 72.46
"rpart.matrix" 294.26 27.11 1081.78 99.66
"model.frame.default" 1.04 0.10 3.00 0.28
"terms.formula" 0.96 0.09 0.96 0.09
"as.list.data.frame" 0.46 0.04 0.46 0.04
"makepredictcall.default" 0.46 0.04 0.46 0.04
"rpart" 0.44 0.04 1085.38 99.99
"[[.data.frame" 0.16 0.01 0.42 0.04
"<Anonymous>" 0.16 0.01 0.18 0.02
"match" 0.14 0.01 0.22 0.02
"print" 0.12 0.01 0.12 0.01
"model.matrix.default" 0.10 0.01 0.44 0.04
....
$by.total
total.time total.pct self.time self.pct
"rpart" 1085.38 99.99 0.44 0.04
"rpart.matrix" 1081.78 99.66 294.26 27.11
"[[<-" 786.62 72.47 0.06 0.01
"[[<-.data.frame" 786.56 72.46 786.46 72.45
"model.frame.default" 3.00 0.28 1.04 0.10
"eval" 3.00 0.28 0.04 0.00
"eval.parent" 3.00 0.28 0.00 0.00
"model.frame" 3.00 0.28 0.00 0.00
"terms.formula" 0.96 0.09 0.96 0.09
"terms" 0.96 0.09 0.00 0.00
"makepredictcall" 0.50 0.05 0.04 0.00
"as.list.data.frame" 0.46 0.04 0.46 0.04
"makepredictcall.default" 0.46 0.04 0.46 0.04
"as.list" 0.46 0.04 0.00 0.00
"vapply" 0.46 0.04 0.00 0.00
"model.matrix.default" 0.44 0.04 0.10 0.01
"[[" 0.44 0.04 0.02 0.00
"model.matrix" 0.44 0.04 0.00 0.00
....
$sample.interval
[1] 0.02
$sampling.time
[1] 1085.5
Note from the above that a big chunk of time is spent in function rpart.matrix
:
> rpart:::rpart.matrix
function (frame)
{
if (!inherits(frame, "data.frame") || is.null(attr(frame,
"terms")))
return(as.matrix(frame))
for (i in 1:ncol(frame)) {
if (is.character(frame[[i]]))
frame[[i]] <- as.numeric(factor(frame[[i]]))
else if (!is.numeric(frame[[i]]))
frame[[i]] <- as.numeric(frame[[i]])
}
X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
class(X) <- c("rpart.matrix", class(X))
X
}
But it is the for
loop in that function where most of the time is spent, essentially converting each column and adding them back to the data frame.