Search code examples
rrandom-forestr-caretr-ranger

Making caret train rf faster when ranger is not an option


The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use

control <- trainControl(method = 'repeatedcv',
                        number = 3,
                        repeats = 1,
                        search = 'grid')

tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))

fit <- train(value~.,
             data = train_1,
             method = 'rf',
             ntree = 73,
             tuneGrid = tunegrid,
             trControl = control)

Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?

This is my code from ranger which I optimized and currently having accurate model

fit <- ranger(value ~ ., 
              data = train_1, 
              num.trees = 73,
              max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")

Thank you so much for your time


Solution

  • When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.

    library(randomForest)
    fit <- randomForest(value ~ ., data=train_1)
    

    You can specify values for ntree, mtry etc.

    Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?