Search code examples
rrandom-forestr-ranger

Access which number of trees had the lowest error when running random forest


I am following this example and I want to change one part of the code from:

# default RF model
m1 <- randomForest(
  formula = Sale_Price ~ .,
  data    = ames_train
)

# number of trees with lowest MSE
btree <- which.min(m1$mse)

to it's equivalent ranger-based code. The issue is that ranger doesn't provide access directly to number of trees with the lowest MSE. How can I calculate the and store in a variable (I call this var btree) the number of trees with the lowest MSE?

library(rsample)      # data splitting 
library(randomForest) # basic implementation
library(ranger)       # a faster implementation of randomForest

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

# for reproduciblity
set.seed(123)

# default RF model
m1 <- randomForest(
  formula = Sale_Price ~ .,
  data    = ames_train
)

# the equivalent in ranger
m1 <- ranger(
      formula = Sale_Price ~ .,
      data    = ames_train
    )

# number of trees with lowest MSE (randomForest package)
btree <- which.min(m1$mse)

Based on the ranger documentation:

prediction.error: Overall out-of-bag prediction error. For classification this is accuracy (proportion of misclassified observations), for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index.

So if I do:

  m1 <- ranger(
    formula = Sale_Price ~ .,
    data    = ames_train
  )
  
  # number of trees with highest r2
  btree = which.max(m1$prediction.error)
  print(btree)

The result is:

[1] 1

which obviously is not right.


Solution

  • I don't think there is a way to get this directly from the ranger outputs. But you could run predictions for each tree and calculate it yourself. For example:

    m1 <- ranger(
      formula = Sale_Price ~ .,
      data    = ames_train,
      keep.inbag = TRUE, 
      write.forest = TRUE 
    )
    
    num_trees <- m1$num.trees
    predictions <- matrix(nrow = num_trees, ncol = nrow(ames_train))
    mse <- numeric(num_trees)
    
    for(i in 1:num_trees){
      pred <- predict(m1, 
                      data = ames_train, 
                      num.trees = i)$predictions
      mse[i] <- mean((pred - ames_train$Sale_Price)^2)
    }
    
    btree <- which.min(mse)